Data Pre-processing refers to add, delete or transform features of the trainning set. Different models have different sensibility of the predictors values, how the predictors enter the model is important. Transformations to reduce skewness or outliers can improve performance, combine two or more predictors, removing predictors without information are some approachs to improve the performance.
The need of pre-processing are determined by the predictor, tree based methods are insensitive for characteristics of predictors. Linear regressions are not, this chapter will cover unsupervised techniques (doesnt consider the outcome to remove variables) to pre-processing. For example Partial Least Square (PLS) is a flavour of supervised PCA.
How the predictors are encoded, called feature engeneering, can have a significant impact in the performance. For example combining predictors or taking the ratio between predictors (this is art not science). There are techniques to encode the data, in the cahpters \(12-15\) we will study a dataset to predict academic sucess, one piece of information are submission date of grant it could be represented in many ways:
The ‘correct’ feature engeneering depends on several factors. Some models handle predictors in different ways, tree based methods for example partition the data in binnings. For some models multiple encoding of the same predictor could make the model underperforming. The second factor is the relation between the predictor and the outcome, if there are seasonal components maybe the day of the year is the best choise, if some months show more sucess rate then the month encoding is the best. The answer is depends on the model and the true relationship with the outcome.
we will use a dataset to identify the impact of some medicines in cell shape, development status and number of cells. The dataset contains \(2019\) cells, were \(1300\) were judge to be poorly segmented (PS) and \(719\) were well segmented (WS); \(1009\) cells are the trainning set. There are \(116\) features for every cell measured to predict the quality of cell segmentation.
library(AppliedPredictiveModeling)
data(segmentationOriginal)
## Retain the original training set
segTrain <- subset(segmentationOriginal, Case == "Train")
## Remove the first three columns (identifier columns)
segTrainX <- segTrain[, -(1:3)]
segTrainClass <- segTrain$Class
Some modeling techinques require transformations like all predictors in the same scale. Create a good model could be dificult because some outliers. In these chapter we will discuss centering, scalling and skewess transformations.
To center: Take the average value of the predictor and subtract it from all values as the result the predictor will have a zero mean To scalle: Each value of predictor is divided by its standard deviation as the result the new standard deviation will be \(1\)
These manipulations improve numerical stability and PLS benefits from predictors on the same scalle.
We can found rigth and left skewness in the dataset predictors, it could be evaluated using the following statistic: \[ skewness = \frac{\sum(x_i - \bar{x})^{3}}{(n-1)v^{\frac{3}{2}}} \] where \[ v = \frac{\sum(x_i - \bar{x})^{2}}{(n-1)} \] where the \(x\) is the predictor, \(n\) is the number of values and \(\bar{x}\) is the sample mean. More the distribution skewness to right greater the value of skewness, more the distribution skewned to left smaller the values of skewness.
Simples test to evaluate the degree of skewness are the ratio between the greates over the lower values of the distribution if this value is greater then \(20\) then we have a high skewness.
Box and Cox have found a method to empirically identify the an appropriate transofrmation \[ x^{*} = \begin{cases} \frac{x^{\lambda} - 1}{\lambda}, & \text{if}\ \lambda\neq 0 \\ \log (x), & \text{if}\ \lambda = 0 \end{cases} \]
for \(\lambda = 2\) square transformation, \(\lambda = 0.5\) square root, \(\lambda = -1\) inverse and others. Its possible to estimate \(\lambda\) usgin the trainning data and maximum likelihood this procedure can be apply to all predictor with values greather than zero.
The upper picture before and after apply a log transoformation over the predictor data. The bellow picture before and after apply a inverse transformation over the dataset.
Skewness case 2
# Clues for skewness
print(max(segTrainX$VarIntenCh3)/min(segTrainX$VarIntenCh3))
[1] 870.8872
# Value of skewness
library(e1071)
print(skewness(segTrainX$VarIntenCh3))
[1] 2.391624
# applying box cox transformations
library(caret)
segPP <- preProcess(segTrainX, method = "BoxCox")
segTrainTrans <- predict(segPP, segTrainX)
print(segPP$bc$VarIntenCh3)
Box-Cox Transformation
1009 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.8693 37.0615 68.1316 101.6718 124.9899 757.0210
Largest/Smallest: 871
Sample Skewness: 2.39
Estimated Lambda: 0.1
With fudge factor, Lambda = 0 will be used for transformations
g1 <- histogram(~segTrainX$VarIntenCh3, xlab = "Natural Units", type = "count")
plot(g1)
g2 <- histogram(~log(segTrainX$VarIntenCh3), xlab = "Log Units", ylab = " ", type = "count")
plot(g2)
print( segPP$bc$PerimCh1)
Box-Cox Transformation
1009 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
47.74 64.37 79.02 91.61 103.24 459.77
Largest/Smallest: 9.63
Sample Skewness: 2.59
Estimated Lambda: -1.1
g3 <- histogram(~segTrainX$PerimCh1, xlab = "Natural Units", type = "count")
plot(g3)
g4 <- histogram(~segTrainTrans$PerimCh1, xlab = "Transformed Data", ylab = " ", type = "count")
plot(g4)
These transformations act on groups of predictors the most importance are methods to resolve outliers and reduce the dimensionality.
Before remove the outliers we need to ensure they arent special cases of data like clients with high salles. Second we need to ensure they aren’t recording errors during data collection step. Some models are resistant to outliers like tre based models. If the model are sensitive to outliers we can apply a spatial sign a procedure wich projects the predictor in a multidimensional sphere, each sample ios divided by its squared norm: \[ x_{ij}^{*} = \frac{ x_{ij} }{ \sqrt{ \sum\limits^{P}_{j=1} x_{ij}^{2}}} \] this transformation requires center and scaled predictor because the denominator measures the squared distance from the center of distribution.
The following picture shows the data before and after apply spatial sign:
Spatial Sign
Data reduction techiniques reduces the number of predictors of the data keeping the majority of information original data. This fewer variables keep the original dataset fidelity, this kind of method wich creates surrogate variables are called feature extraction or signal extraction.
Principal components Analisys (PCA) is a linear combination between the predictors knos as principal components (PC) wich captures the most possible variance. The j th PC can be writen as: \[ PC_j = (a_{j1} \times Predictor1) + (a_{j2} \times Predictor2) + (a_{j3} \times Predictor3) + \ldots + (a_{j4} \times Predictor4) \] where \(P\) is the number of the predictors, the coefficients \(a_{j1},a_{j2},a_{j3},a_{j4}\) are component weigths used to discover the most important predictors to each PC. The following picture shows a dataset before and after apply PCA
PCA
The Figure [PCA] shows two correlated predictors (correl \(= 0.93\)) these two predictors have redundant information, these predictors or a linear combination of them coukld be use in place of these predictors. In the picture we can see the PCA principal components both uncorrelated, the first PC summarize 97% of all the variation.
PCA look for linear combinations of the predictors that maximize the variability it summarizes the predictors with more variation. You need to scale and centralize the data before apply PCA to avoid misconfusion of magnitude order. Besides that its very usefull to remove the skewness of predictors before apply PCA.
PCA is unsupervised method because it doesnt consider the modeling objective or response variable when summarizing the variability. A supervised approach named PLS considers the target variable.
After the pre-processing stage we can finally apply PCA and decide how many principal components to retain, for that we can use the scree plot as bellow:
PCA
The first principal components keep the main variance of the model, analisy the pcs is a important step . To do this, the pcs can ploted against each other and the ploted symbols can be target variables. Some care should be taken while plot this kind of chart whit the scale of values of each component, the last components keep the less amount of variance.
The next figure shows PC against other PC, the colored symbols are the classes.
PC against PC
The percentages of variance are not large for the three first components then we cant over interpret the chart. The plot shows some separation between classes using the PC1 and PC2 but its not true. These chart show the data is not easily separated.
Another use of PCA is to identify which predictors are associated with each PC, each component is a linear combination of the predictors the coefficients are called loading. Loadings close to zero means there are small contribution to the predictor for the PC, the following figure shows the loadings for the first three PC.
PCA
That figure shows to us the first channel have great impact on the firt component because the loadings are on the extreme of the PC1, the majority of loadings for the third channel are closed to zero it means the third channel doesnt contribute with the first component. For the PC3 the channel 3 contributes a lot because the loadings are on the extremes.
#PCA
pr <- prcomp(~ AvgIntenCh1 + EntropyIntenCh1, data = segTrainTrans, scale. = TRUE)
#Chart options
transparentTheme(pchSize = .7, trans = .3)
plot1 <- xyplot(AvgIntenCh1 ~ EntropyIntenCh1,
data = segTrainTrans,
groups = segTrain$Class,
xlab = "Channel 1 Fiber Width",
ylab = "Intensity Entropy Channel 1",
auto.key = list(columns = 2),
type = c("p", "g"),
main = "Original Data",
aspect = 1)
print(plot1)
plot2 <- xyplot(PC2 ~ PC1,
data = as.data.frame(pr$x),
groups = segTrain$Class,
xlab = "Principal Component #1",
ylab = "Principal Component #2",
main = "Transformed",
xlim = extendrange(pr$x),
ylim = extendrange(pr$x),
type = c("p", "g"),
aspect = 1)
print(plot2)
## Apply PCA to the entire set of predictors.
## There are a few predictors with only a single value, so we remove these first
## (since PCA uses variances, which would be zero)
isZV <- apply(segTrainX, 2, function(x) length(unique(x)) == 1)
segTrainX <- segTrainX[, !isZV]
segPP <- preProcess(segTrainX, c("BoxCox", "center", "scale"))
segTrainTrans <- predict(segPP, segTrainX)
segPCA <- prcomp(segTrainTrans, center = TRUE, scale. = TRUE)
## Plot a scatterplot matrix of the first three components
transparentTheme(pchSize = .8, trans = .3)
panelRange <- extendrange(segPCA$x[, 1:3])
splom(as.data.frame(segPCA$x[, 1:3]),
groups = segTrainClass,
type = c("p", "g"),
as.table = TRUE,
auto.key = list(columns = 2),
prepanel.limits = function(x) panelRange)
## Format the rotation values for plotting
segRot <- as.data.frame(segPCA$rotation[, 1:3])
## Derive the channel variable
vars <- rownames(segPCA$rotation)
channel <- rep(NA, length(vars))
channel[grepl("Ch1$", vars)] <- "Channel 1"
channel[grepl("Ch2$", vars)] <- "Channel 2"
channel[grepl("Ch3$", vars)] <- "Channel 3"
channel[grepl("Ch4$", vars)] <- "Channel 4"
segRot$Channel <- channel
segRot <- segRot[complete.cases(segRot),]
segRot$Channel <- factor(as.character(segRot$Channel))
## Plot a scatterplot matrix of the first three rotation variables
transparentTheme(pchSize = .8, trans = .7)
panelRange <- extendrange(segRot[, 1:3])
library(ellipse)
upperp <- function(...)
{
args <- list(...)
circ1 <- ellipse(diag(rep(1, 2)), t = .1)
panel.xyplot(circ1[,1], circ1[,2],
type = "l",
lty = trellis.par.get("reference.line")$lty,
col = trellis.par.get("reference.line")$col,
lwd = trellis.par.get("reference.line")$lwd)
circ2 <- ellipse(diag(rep(1, 2)), t = .2)
panel.xyplot(circ2[,1], circ2[,2],
type = "l",
lty = trellis.par.get("reference.line")$lty,
col = trellis.par.get("reference.line")$col,
lwd = trellis.par.get("reference.line")$lwd)
circ3 <- ellipse(diag(rep(1, 2)), t = .3)
panel.xyplot(circ3[,1], circ3[,2],
type = "l",
lty = trellis.par.get("reference.line")$lty,
col = trellis.par.get("reference.line")$col,
lwd = trellis.par.get("reference.line")$lwd)
panel.xyplot(args$x, args$y, groups = args$groups, subscripts = args$subscripts)
}
splom(~segRot[, 1:3],
groups = segRot$Channel,
lower.panel = function(...){}, upper.panel = upperp,
prepanel.limits = function(x) panelRange,
auto.key = list(columns = 2))
Error in grid.Call.graphics(C_downviewport, name$name, strict) :
Viewport 'plot_01.panel.1.1.off.vp' was not found
Some datasets has missing values there are two main types of missing values: i) structurally missing (number of children a man has given birth to); ii) The value was not determined. The most important is to understand why the values are missing, is the pattern of missing data is related with the output? (informative missingness) this type of missing can introduce significant bias at the model. Like a patient using a new drug with side effects, maybe he will not return at the doctors/scientist to keep the study.
There are censored data tipically found in the lab experiments when the censor cant measure something. The value is lower than the ,limit but we cant measure it preciselly. For inference or interpretation is commom to use a random value, interpolation or mean observed value to fill this data. According to author experience missing values are more concentrated at predictors instead of sample in some cases is possible to remove the predictor with high ocurrence of missing values.
For large datasets removing missing samples are not problem if the missingness is not informative. For smaller datasets removing missing values could be problematic because we can lost some patterns or the model cant learn, there are other possible approachs.
Tree based models can handle missing values or we can impute the missing data ussing another predictor. We build a model with trainning set to fill values of predictors using another predictors. If the number of predictors affected by missing is small a EDA could be done to discover the relationship between the predictors.
Its possible to use PCA to determine if are a strong relationship between the predictors, if a variable has missing values and has strong correlation with another predictor that has few missing values, a focused model can often be effective for imputation (like the example bellow)
KNN could be use as imputation technique a new sample is showed to model to find the closest samples of it and averages the nearby to fill the missing values. The advantages: i) the value will be inside the training data values. Disadvantages: i) entire trainning set are need for each new value; ii) Also, the number of neighboors is a tunning parameter as the method to determine closeness of two points.
The bellow figure show two approachs to fill missing data: i) 5NN; ii) linear regression
PCA
The two main advantages to remove some predictors are: i) less computacional time to process; ii) predictors with high correlation are measuring the same information, removing one will not compromisse the performance and simplifies the model; iii) some models can be affected by degenerated distributions of predictors, remove then can improve then model performance.
Tree models are impervious about this poroblems because they will never use this predictor as a split point on the other hand linear regression can be affected. In both cases the predictor will not add ionfoprmation to the model then its better to remove the variables. Near Zero Variance is one famous approach to that, it means variables with small number of values.
How to find this kind of problem? Simple the number of unique points are small relative to the number of samples. Some scientific languages like R and python has packages of near zero variance.
Collinearity is when two predictors have a substancial correlation with each other, multiple predictors can suffer from this to (its called multicollinearity). A heat map with correlation can help to visualize the problem, the variables could be grouped using a cluster algorithm to make it easier to see the high correlation areas.
Greater datasets can use PCA to help to measure the magnitude of the problem, if the first PC accounts for a large percentage of variance this implies at least one group of predictors with represent the same information. PCA loadings can be use to find where are this relationships and which variables are inside it.
There are good reasons to avoid data with high correlated predictors, they add more complexity to the model than information they provide, for linear regression usign correlated features can cause unstable models, numerical errors and degraded predictive performance.
The following picture show a heat map of correlation matrix:
PCA
Classical regression techniques uses Variance Inflation Factor (VIF) to identify multicollinearity but it only works for linear models, but it can determine which variables to remove. A more practical approach is remove the minimum number of predictors to ensure all pairwise relations are bellow a certain threshold, the algorithm is:
1 . Calculate the correl matrix 2 . Determine the two predictors with largest absolute pairwise correl (A and B predictors) 3 . Determine the averga correl between A and other variables, do the same for B 4 . If A has a greater correl remove it, otherwise remove B 5 . Repeat steps 2-4 until no absolute correl is greater than threshold
These methods of feature extraction cant guarantee they will find any relation between the predictors and the outcome.
Numerical/Categorical values can be encoded as dummy variables include dummy in models with intercept could reduce the performance. On the other hand to a tree based model this kind of variable could be a great idea to improve the performance.
The following figure show a classic logistic regression and a logistic regression which has added a squared B predictor
Regressão LogÃstica
There are cases in the literature Fiorina et al (2009) recommend adds complex combination of data to the model, for classification they evaluate class centroids the center of the predictor data for each class. Then for each predictor the distance to each class centroid can be calculated and this distances can be added for the model.
#Get correl matrix
segCorr <- cor(segTrainTrans)
library(corrplot)
corrplot 0.84 loaded
corrplot(segCorr, order = "hclust", tl.cex = .35)
## caret's findCorrelation function is used to identify columns to remove.
highCorr <- findCorrelation(segCorr, .75)
The author of the book dont recommend to manual binning the predictors, the advantage of binning is: i) increase the interpretability of the model and the disadvantage are: i) decrease the performance of the model. Some statiscal models can help the modeler to do that job using statistcal information.
Following the code to pre-process the data and remove some features:
library(AppliedPredictiveModeling)
data(segmentationOriginal)
segData <- subset(segmentationOriginal, Case == "Train")
cellId <- segData$Cell
class <- segData$Class
case <- segData$Case
#removing columns
segData <- segData[, -(1:3)]
#remove status columns
StatusColNum <- grep("Status", names(segData))
segData <- segData[,-StatusColNum]
print(segData)
Following the code to identify skewness:
library("e1071")
# For one predictor
AngleCh1Skewness <- skewness(segData$AngleCh1)
print(AngleCh1Skewness)
[1] -0.02426252
#using apply over all columns
skewnessAll <- apply(segData, 2, skewness)
head(skewnessAll)
AngleCh1 AreaCh1 AvgIntenCh1 AvgIntenCh2 AvgIntenCh3 AvgIntenCh4
-0.02426252 3.52510745 2.95918524 0.84816033 2.20234214 1.90047128
# Using lattice to plot the shape and boxcox to see what kind of transformation could be apply
library("caret")
Ch1AreaTrans <- BoxCoxTrans(segData$AreaCh1)
print(Ch1AreaTrans)
Box-Cox Transformation
1009 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
150.0 194.0 256.0 325.1 376.0 2186.0
Largest/Smallest: 14.6
Sample Skewness: 3.53
Estimated Lambda: -0.9
print("The original data")
[1] "The original data"
head(segData$AreaCh1)
[1] 819 431 298 256 258 358
print("After transformation")
[1] "After transformation"
preditedTransformation <- predict(Ch1AreaTrans, head(segData$AreaCh1))
print(preditedTransformation)
[1] 1.108458 1.106383 1.104520 1.103554 1.103607 1.105523
# Now apply PCA over the dataset
pcaObject <- prcomp(segData,
center = TRUE,
scale = TRUE)
# Cumulative variance for each component
percentVariance <- pcaObject$sd^2 / sum(pcaObject$sd^2) * 100
print(percentVariance[1:3])
[1] 20.91236 17.01330 11.88689
# Transformed values
print(head(pcaObject$x[,1:5]))
PC1 PC2 PC3 PC4 PC5
2 5.0985749 4.5513804 -0.03345155 -2.640339 1.2783212
3 -0.2546261 1.1980326 -1.02059569 -3.731079 0.9994635
4 1.2928941 -1.8639348 -1.25110461 -2.414857 -1.4914838
12 -1.4646613 -1.5658327 0.46962088 -3.388716 -0.3302324
15 -0.8762771 -1.2790055 -1.33794261 -3.516794 0.3936099
16 -0.8615416 -0.3286842 -0.15546723 -2.206636 1.4731658
# PCA loadings
print(head(pcaObject$rotation[,1:3]))
PC1 PC2 PC3
AngleCh1 0.001213758 -0.01284461 0.006816473
AreaCh1 0.229171873 0.16061734 0.089811727
AvgIntenCh1 -0.102708778 0.17971332 0.067696745
AvgIntenCh2 -0.154828672 0.16376018 0.073534399
AvgIntenCh3 -0.058042158 0.11197704 -0.185473286
AvgIntenCh4 -0.117343465 0.21039086 -0.105060977
# caret package has spatial sign: spatialSign(data) to transform data in sphere
# to input missing values you can use: inpute package like impute.knn or bagged tree
# caret package has preProcess function to: transform, center, scale, boxcox, pca variables, and others
# the predict function applies the results to a dataset
trans <- preProcess(segData,
method = c("BoxCox","center","scale","pca"))
print(trans)
Created from 1009 samples and 58 variables
Pre-processing:
- Box-Cox transformation (47)
- centered (58)
- ignored (0)
- principal component signal extraction (58)
- scaled (58)
Lambda estimates for Box-Cox transformation:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.00000 -0.50000 -0.10000 0.05106 0.30000 2.00000
PCA needed 19 components to capture 95 percent of the variance
#apply the transformations
transformed <- predict(trans, segData)
print(head(transformed[,1:5]))
To filter near zero var its possible to use caret package function: nearZeroVar
library(caret)
#it returns a vector of integers, each integer is a position to be removed from the dataset
columnsToRemove <- nearZeroVar(segData)
print(columnsToRemove)
integer(0)
#To filter the hogh correlated predictors we can use *cor* function
correlation <- cor(segData)
print(dim(correlation))
[1] 58 58
correlation[1:4, 1:4]
AngleCh1 AreaCh1 AvgIntenCh1 AvgIntenCh2
AngleCh1 1.000000000 -0.002627172 -0.04300776 -0.01944681
AreaCh1 -0.002627172 1.000000000 -0.02529739 -0.15330301
AvgIntenCh1 -0.043007757 -0.025297394 1.00000000 0.52521711
AvgIntenCh2 -0.019446810 -0.153303007 0.52521711 1.00000000
# Show high correlated using an heat map
library(corrplot)
corrplot(correlation, order = "hclust")
#To apply a filter based on correl we can use the function: *findCorrelation*
highCorrel <- findCorrelation(correlation, cutoff = 0.75)
length(highCorrel)
[1] 32
head(highCorrel)
[1] 23 40 43 36 7 15
filteredSegData <- segData[,-highCorrel]
There are much approachs to create dummy variables, one is using formula notation of R. Splits of tree based methods can be increased using only dummy variables, to illustrate we will use a subset of cars dataset
data(cars)
type <- c("convertible", "coupe", "hatchback", "sedan", "wagon")
cars$Type <- factor(apply(cars[, 14:18], 1, function(x) type[which(x == 1)]))
carSubset <- cars[sample(1:nrow(cars), 20), c(1, 2, 19)]
head(carSubset)
levels(carSubset$Type)
[1] "convertible" "coupe" "hatchback" "sedan" "wagon"
# To model price as a function of mileage and type we can use
#dummyVars to find the encodings for the predictors
simpleModel <- dummyVars(~Mileage + Type,
data = carSubset,
##Remove the variable name from the column name
levelsOnly=TRUE)
print(simpleModel)
Dummy Variable Object
Formula: ~Mileage + Type
2 variables, 1 factors
Factor variable names will be removed
A less than full rank encoding is used
# Combining the dummyVars object and predict we can create the dummy variables
predict(simpleModel, head(carSubset))
Mileage convertible coupe hatchback sedan wagon
494 21974 0 0 0 1 0
129 24469 0 0 0 1 0
307 21417 0 0 0 1 0
272 24583 0 0 1 0 0
269 23460 0 0 1 0 0
673 26143 0 1 0 0 0
The field type was expanded intpo five categories dummy, the model is simple because we assume the effect of mileage is the same for every type. A more complex model could consider a joint effect called interaction
withInteraction <- dummyVars(~Mileage + Type + Mileage:Type,
data = carSubset,
##Remove the variable name from the column name
levelsOnly=TRUE)
predict(withInteraction, head(carSubset))
Mileage convertible coupe hatchback sedan wagon Mileage:convertible Mileage:coupe Mileage:hatchback Mileage:sedan
494 21974 0 0 0 1 0 0 0 0 21974
129 24469 0 0 0 1 0 0 0 0 24469
307 21417 0 0 0 1 0 0 0 0 21417
272 24583 0 0 1 0 0 0 0 24583 0
269 23460 0 0 1 0 0 0 0 23460 0
673 26143 0 1 0 0 0 0 26143 0 0
Mileage:wagon
494 0
129 0
307 0
272 0
269 0
673 0
The interaction creates more five variables.
library(mlbench)
data(Glass)
str(Glass)
'data.frame': 214 obs. of 10 variables:
$ RI : num 1.52 1.52 1.52 1.52 1.52 ...
$ Na : num 13.6 13.9 13.5 13.2 13.3 ...
$ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
$ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
$ Si : num 71.8 72.7 73 72.6 73.1 ...
$ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
$ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
$ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
$ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
$ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
library(reshape2)
meltedGlass <- melt(Glass, id.vars = "Type")
head(meltedGlass)
library(lattice)
plot1 <- densityplot(~value|variable,
data = meltedGlass,
## Adjust each axis so that the measurement scale is
## different for each panel
scales = list(x = list(relation = "free"),
y = list(relation = "free")),
## 'adjust' smooths the curve out
adjust = 1.25,
## change the symbol on the rug for each data point
pch = "|",
xlab = "Predictor")
plot(plot1)
splom(~Glass[, -10], pch = 16, col = rgb(.2, .2, .2, .4), cex = .7)
From Figure 1 we can see K and Mg predictors have a second mode closest to zero and the other predictors (Ca, Ba, Fe, and RI) shows sign of skewness. There are some outliers at K predictor, but it could be due to natural skewness. The predictors Ca, RI, Na and Si have concentrations of samples in the middle of the scale and a small number of data points at the edge of distribution, heavy tailed distribution. Its possible to plot a scatterplot from all the predictors, analisyng the scaterplot we can see some predictors ( Ba, Fe, Mg and K) measurement are zero it creates a mixture distribution of points some with the value and others not. The only correlated predictors are Ca/RI and CaNa, many pairwise distributions have a non standard distribution (i.e. heavy tails or mixtures of distributions). Its difficult to say if the outlier in K predictor is an outlier or an artifact of skewed distribution without sample enough, our best option is to use models resistent to outliers.
How to remove skewness and outliers? We cant use box cox family of transformations because there are zero values in the distributions but we can use Yeo Johnson family of transformations. This family of transformations are similar to box coxz transformations but it can handle zero and negative values. The only significative change was the introduction of the second mode for Ba, Fe the transformation did not improve the data.
yjTrans <- preProcess(Glass[, -10], method = "YeoJohnson")
yjData <- predict(yjTrans, newdata= Glass[, -10])
melted <- melt(yjData)
No id variables; using all as measure variables
plot1 <- densityplot(~value|variable,
data = melted,
## Adjust each axis so that the measurement scale is
## different for each panel
scales = list(x = list(relation = "free"),
y = list(relation = "free")),
## 'adjust' smooths the curve out
adjust = 1.25,
## change the symbol on the rug for each data point
pch = "|",
xlab = "Predictor")
plot(plot1)
centerScale <- preProcess(Glass[, -10], method = c("center", "scale"))
csData <- predict(centerScale, newdata = Glass[, -10])
ssData <- spatialSign(csData)
splom(~ssData, pch = 16, col = rgb(.2, .2, .2, .4), cex = .7)
library(mlbench)
data(Soybean)
print(head(Soybean))
print(str(Soybean))
'data.frame': 683 obs. of 36 variables:
$ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
$ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
$ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
$ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
$ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
$ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
$ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
$ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
$ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
$ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
$ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
$ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
$ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
$ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
$ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
$ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
$ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
$ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
$ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
NULL
Soybean2 <- Soybean
print(table(Soybean2$temp, useNA = "always"))
0 1 2 <NA>
80 374 199 30
library(car)
Loading required package: carData
Attaching package: ‘car’
The following object is masked from ‘package:ellipse’:
ellipse
Soybean2$temp <- recode(Soybean2$temp,
"0 = 'low'; 1 = 'norm'; 2 = 'high'; NA = 'missing'",
levels = c("low", "norm", "high", "missing"))
print(table(Soybean2$temp))
low norm high missing
80 374 199 30
The date predictor distribution has more data at september than in other months, there is only a single missing value. The precipitation distribution most of data are above average, temperature and precipitation predictors have missing value rates about 5%. Like in the other previous problems we need to analise the joint distribution of the predictors. For categorical predictors joint distribution could be show as contingency table. Its possible to plot like barcharts or mosaic diagrams these kind of distributions.
library(vcd)
## mosaic() can table a table or a formula:
mosaicPlot <- mosaic(~date + temp, data = Soybean2)
plot(mosaicPlot)
#Alternatively, a bar chart can also be used:
barPlot <- barchart(table(Soybean2$date, Soybean2$temp), auto.key = list(columns = 4, title = "temperature"))
plot(barPlot)
The two charts show to us the missing values are not the most frequent ones, the average temperature are the most frequent, september are the month with the most probability to high temperatures (summer at north hemisphere), missing values are the most frequent at July.
Now we will show the distribution per response classfor the missing data. If we look at the frequency of any missing predictor value per class, the results show that some classes are more problematic than others.
Now we will show the percentage of missing values from each predictor:
byPredByClass <- apply(Soybean[, hasMissing], 2,
function(x, y) {
tab <- table(is.na(x), y)
tab[2,]/apply(tab, 2, sum)
},
y = Soybean$Class)
## The columns are predictors and the rows are classes. Let's eliminate
## any rows and columns with no missing values
byPredByClass <- byPredByClass[apply(byPredByClass, 1, sum) > 0,]
byPredByClass <- byPredByClass[, apply(byPredByClass, 2, sum) > 0]
## now print:
t(byPredByClass)
2-4-d-injury cyst-nematode diaporthe-pod-&-stem-blight herbicide-injury phytophthora-rot
date 0.0625 0 0.0 0 0.0000000
plant.stand 1.0000 1 0.4 0 0.0000000
precip 1.0000 1 0.0 1 0.0000000
temp 1.0000 1 0.0 0 0.0000000
hail 1.0000 1 1.0 1 0.7727273
crop.hist 1.0000 0 0.0 0 0.0000000
area.dam 0.0625 0 0.0 0 0.0000000
sever 1.0000 1 1.0 1 0.7727273
seed.tmt 1.0000 1 1.0 1 0.7727273
germ 1.0000 1 0.4 1 0.7727273
plant.growth 1.0000 0 0.0 0 0.0000000
leaf.halo 0.0000 1 1.0 0 0.6250000
leaf.marg 0.0000 1 1.0 0 0.6250000
leaf.size 0.0000 1 1.0 0 0.6250000
leaf.shread 1.0000 1 1.0 0 0.6250000
leaf.malf 0.0000 1 1.0 0 0.6250000
leaf.mild 1.0000 1 1.0 1 0.6250000
stem 1.0000 0 0.0 0 0.0000000
lodging 1.0000 1 1.0 1 0.7727273
stem.cankers 1.0000 1 0.0 1 0.0000000
canker.lesion 1.0000 1 0.0 1 0.0000000
fruiting.bodies 1.0000 1 0.0 1 0.7727273
ext.decay 1.0000 1 0.0 1 0.0000000
mycelium 1.0000 1 0.0 1 0.0000000
int.discolor 1.0000 1 0.0 1 0.0000000
sclerotia 1.0000 1 0.0 1 0.0000000
fruit.pods 1.0000 0 0.0 0 0.7727273
fruit.spots 1.0000 1 0.0 1 0.7727273
seed 1.0000 0 0.0 1 0.7727273
mold.growth 1.0000 0 0.0 1 0.7727273
seed.discolor 1.0000 1 0.0 1 0.7727273
seed.size 1.0000 0 0.0 1 0.7727273
shriveling 1.0000 1 0.0 1 0.7727273
roots 1.0000 0 1.0 0 0.0000000
From this output, we see that there are many predictors completely missing for the 2-4-d-injury, cyst-nematode and herbicide-injury classes. The phytophthora-rot class has a high rate of missing data across many predictors and the diaporthe-pod-&-stem-blight has a more moderate pattern of missing data.
To handle missing data we could use some imputation technique but almost 100% of predictors need to be inputed in some cases, probably it wont work. We can encode missing value as another level or eliminate classes with high rate of missing values.
How would the frequencies of the predictor values aect the modeling process? If we are using a model that is sensitive to sparsity then the low rate of some of the factor levels might be an issue. We can convert the factors to a set of dummy variables and see how good or bad the sparsity is
## Some of the factors are ordinal. First convert them to unordered factors so
## that we get a set of binary indicators.
orderedVars <- unlist(lapply(Soybean, is.ordered))
orderedVars <- names(orderedVars)[orderedVars]
## Let's bypass the problem of missing data by removing the offending classes
completeClasses <- as.character(unique(Soybean$Class[complete.cases(Soybean)]))
Soybean3 <- subset(Soybean, Class %in% completeClasses)
for(i in orderedVars) Soybean3[, i] <- factor(as.character(Soybean3[, i]))
## Use dummyVars to generate the binary predictors...
dummyInfo <- dummyVars(Class ~ ., data = Soybean3)
dummies <- predict(dummyInfo, Soybean3)
variable 'Class' is not a factor
## ... then nearZeroVar to figure out which should be removed.
predDistInfo <- nearZeroVar(dummies, saveMetrics = TRUE)
head(predDistInfo)
## The number and percentage of predictors to remove:
sum(predDistInfo$nzv)
[1] 16
mean(predDistInfo$nzv)
[1] 0.1616162
So if we wanted to remove sparse and unbalanced predictors, 16.2% of the dummy variables would be eliminated. One way around this is to use models that are not sensitive to this characteristic, such as tree or rule based models, or naive Bayes.
library(caret)
data(BloodBrain)
# use ?BloodBrain to see more details
predictorInfo <- nearZeroVar(bbbDescr, saveMetrics = TRUE)
head(predictorInfo)
rownames(predictorInfo)[predictorInfo$nzv]
[1] "negative" "peoe_vsa.2.1" "peoe_vsa.3.1" "a_acid" "vsa_acid" "frac.anion7." "alert"
table(bbbDescr$a_acid)
0 2 3
201 6 1
table(bbbDescr$alert)
0 1
206 2
## Let's get rid of these:
filter1 <- bbbDescr[, !predictorInfo$nzv]
ncol(filter1)
[1] 127
It is time consuming to look at individual density plots of 127 predictors
set.seed(532)
sampled1 <- filter1[, sample(1:ncol(filter1), 8)]
names(sampled1)
[1] "o_sp3" "adistm" "peoe_vsa.6" "o_sp2" "peoe_vsa.3" "fpsa3" "frac.cation7." "wpsa2"
Few predictors have skewness, frac.cation7 has two distinct modes. Based on the rug plot of points in the panel for o sp2, these data are also likely to be bimodal
#Determine skewness
library(e1071)
skew <- apply(filter1, 2, skewness)
summary(skew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.1203 0.2670 0.8902 0.8721 1.6176 5.3313
There are a number of predictors that are left{ or right{skewed. We can again apply the Yeo Johnson transformation to the data (some of the predictors are negative):
Although the distributions for fpsa3 and wpsa2 are more symmetric, the other predictors have either additional modes or more pronounced modes. One option would be to manually assess which predictors would benefit from this type of transformation. Is there severe correlation between the predictors? We can plot a correl matrix to visualize it.
The correlation matrix of the predictors can be computed and examined. However, we know that many predictors are skewed in these data. Since the correlation is a function of squared values of the predictors, the samples in the tails of the predictor distributions may have a significant efect on the correlation structure. For this reason, we will look at the correlation structure three ways: the untransformed data, the data after the Yeo Johnson transformation, and the data after a spatial sign transformation.
rawCorr <- cor(filter1)
transCorr <- cor(transformed)
ssData <- spatialSign(scale(filter1))
ssCorr <- cor(ssData)
library(corrplot)
## plot the matrix with no labels or grid
corrplot(rawCorr, order = "hclust", addgrid.col = NA, tl.pos = "n")
corrplot(transCorr, order = "hclust", addgrid.col = NA, tl.pos = "n")
ssData <- spatialSign(scale(filter1))
ssCorr <- cor(ssData)
corrplot(ssCorr, order = "hclust", addgrid.col = NA, tl.pos = "n")
corrInfo <- function(x) summary(x[upper.tri(x)])
corrInfo(rawCorr)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.00000 -0.16173 0.06434 0.07068 0.28643 1.00000
corrInfo(transCorr)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-9.838e-16 -2.933e-16 -3.854e-17 -3.349e-17 2.049e-16 1.274e-15
corrInfo(ssCorr)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.00000 -0.16062 0.03862 0.05561 0.25545 1.00000
We can remove the high correlated features
thresholds <- seq(.25, .95, by = 0.05)
size <- meanCorr <- rep(NA, length(thresholds))
removals <- vector(mode = "list", length = length(thresholds))
for(i in seq_along(thresholds)){
removals[[i]] <- findCorrelation(rawCorr, thresholds[i])
subMat <- rawCorr[-removals[[i]], -removals[[i]]]
size[i] <- ncol(rawCorr) -length(removals[[i]])
meanCorr[i] <- mean(abs(subMat[upper.tri(subMat)]))
}
corrData <- data.frame(value = c(size, meanCorr),
threshold = c(thresholds, thresholds),
what = rep(c("Predictors",
"Average Absolute Correlation"),
each = length(thresholds)))
Another possible approach is to use subselect package This package uses a diferent criterion to evaluate the quality of a subset and has less greedy methods to search the predictor space. First, we have to remove all linear dependencies from the data. That includes perfect pair{wise correlations as well as relationships between three or more predictors. The trim.matrix function does that:
library(subselect)
ncol(rawCorr)
trimmed <- trim.matrix(rawCorr, tolval=1000*.Machine$double.eps)$trimmedmat
ncol(trimmed)
We can use simulated annealing and genetic algorithms to search for quality subsets. These techniques allow for lower and upper limits for the number of predictors. However, the functions get dramatically slower as the range increases. Here, we will look at one solution found by findCorrelation and, will subsequently use subselect to search within that subset size:
The main diference between these results is that the greedy approach of findCorrelation is much more conservative than the techniques found in the subselect package.